STATS 32 Session 10: A Crash Course in Statistics and Modeling

Kenneth Tay

Oct 24, 2019

Announcements

Project due on 2 Nov (Sat) 23:59:59

Remaining office hours:

25 Oct (tomorrow)
1 Nov (next Friday)

10am-12pm, Sequoia Hall Rm 105

Recap of session 9

Joining datasets with left_join
- df1 %>% left_join(df2, by = "key")
Making maps in R
- geom_polygon()
- aes(x = long, y = lat, group = group)

Agenda for today

A crash course in statistics and modeling
- Hypothesis testing
- Linear regression

A very high level picture: for technical details, take STATS 60/STATS 101

Recall: Lists

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

Extracting parts of a list

Use [[ or $ notation to refer to a specific key-value pair

cars$make         # no quotation marks

## [1] "Honda"

cars[["models"]]  # remember quotation marks!

## [1] "Fit"     "CR-V"    "Odyssey"

Recall: Data frames are lists!

To R, a data frame is simply a special type of list!
- Keys of the list are the variable/covariate names
- Values are vectors of the same length

Today’s dataset: Top 100 songs on Spotify

(Source: Spotify)

Tempo by mode: Is there a difference?

Hard to tell from the histograms:

Look at mean tempo for each mode

Major: \(\approx 122\) bpm
Minor: \(\approx 116\) bpm

Is this difference significant? What do we mean by significance anyway?

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated
Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated
Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
Is the p-value considered low or not?
- Threshold should depend on the context
- Typical thresholds, 0.1, 0.05, 0.01

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated
Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
Is the p-value considered low or not?
- Threshold should depend on the context
- Typical thresholds, 0.1, 0.05, 0.01
If p-value is below threshold, 2 possible conclusions:
- A rare event just happened, or
- Our assumption in Step 1 was false

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

Start with a null hypothesis: Probability of heads \(p = 0.5\)

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

Start with a null hypothesis: Probability of heads \(p = 0.5\)
Based on this assumption, how likely were we to collect data as extreme as what we have?
- “As extreme”: 16 or more heads, or 4 or less heads
- Probability of collecting data as extreme as ours: 0.0118

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

Start with a null hypothesis: Probability of heads \(p = 0.5\)
Based on this assumption, how likely were we to collect data as extreme as what we have?
- “As extreme”: 16 or more heads, or 4 or less heads
- Probability of collecting data as extreme as ours: 0.0118
Is the p-value considered low or not?

Hypothesis test: coin flipping example

I flip a coin 20 times and it came out heads 16 times. Is my coin biased?

Start with a null hypothesis: Probability of heads \(p = 0.5\)
Based on this assumption, how likely were we to collect data as extreme as what we have?
- “As extreme”: 16 or more heads, or 4 or less heads
- Probability of collecting data as extreme as ours: 0.0118
Is the p-value considered low or not?
If p-value is below threshold, 2 possible conclusions:
- A rare event just happened, or
- Our assumption in Step 1 was false

Tempo by mode: Is there a difference?

Two options:

\(t\)-test
- Null hypothesis: Mean tempo for songs in minor key is the same as that for songs in major key
- Makes more assumptions on the data generation process (“parametric test”)

Tempo by mode: Is there a difference?

Two options:

\(t\)-test
- Null hypothesis: Mean tempo for songs in minor key is the same as that for songs in major key
- Makes more assumptions on the data generation process (“parametric test”)
Kolmogorov-Smirnov test
- Null hypothesis: The distribution of tempo for songs in minor key is the same as that for songs in major key
- Less assumptions on data generation process (“non-parametric test”), but rejecting the null gives less information

What is a model?

A model is a simplified and idealized way to understand a system.
R4DS: “The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true “signals” (i.e. patterns generated by the phenomenon of interest), and ignore “noise” (i.e. random variation that you’re not interested in)."

Two steps to modeling

Step 1: Identify a family of models which express a generic pattern between your variables of interest.

Possible model family: Linear model, i.e. \(child = a_1 + a_2 \times parent\).

Variables: child and parent
Model parameters: \(a_1\) and \(a_2\)

Many other possible models: linear without intercept, quadratic, exponential, …

Different models within the linear model family

Each line corresponds to a choice of \(a_1\) and \(a_2\).

Two steps to modeling

Step 2: Find the model in this family that most closely matches your data.

That is, find specific values of \(a_1\) and \(a_2\) which make the model match the data most closely.

What do we mean by “closely matching the data”?

We choose \(a_1\) and \(a_2\) such that some objective function (loss function) is minimized.

Most common objective: Minimize the sum of squares of the black lines below.

(Source: uc-r.github.io)

Linear models in R

Linear regression can be done with the lm function
Syntax: lm(formula, data = df)
Formulas look like y ~ x, which lm will translate to a function like \(y = a_1 + a_2 \cdot x\)

Models with categorical variables

Consider modeling valence ~ mode.

Does the model \(valence = a_1 + a_2 \cdot mode\) make sense?
- 3 + 4 \(\cdot\) “Major”??
What R does:
- Choose a baseline category (say, “Minor”)
- Model \(valence = a_1 + a_2 \cdot modeMajor\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
\(valence = a_1\) if Minor, \(valence = a_1 + a_2\) if Major

Additive models

Formula valence ~ loudness + mode translates to

\(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
\(valence = a_1 + a_2 \cdot loudness\) if Minor
\(valence = (a_1 + a_3) + a_2 \cdot loudness\) if Major
Same gradient, different intercept

Models with interaction

Formula valence ~ loudness * mode translates to

\(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor + \color{blue}{a_4 \cdot loudness \cdot modeMajor}\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
\(valence = a_1 + a_2 \cdot loudness\) if Minor
\(valence = (a_1 + a_3) + (a_2 + a_4) \cdot loudness\) if Major
Different gradient, different intercept

Summary of the course

Variable types
Basic objects in R (vectors, lists, data frames)
Plotting data with ggplot2
Transforming and joining data with dplyr and tidyr
Importing and exporting data
Working with factors using forcats
R scripts and R markdown
Making maps
Basic statistical testing and modeling

Where do we go from here?

Read R for Data Science from cover to cover!
Go through the programming exercises and solutions
Take short courses on DataCamp
Writing your own functions and running simulations
Interactive plots with plotly
Advanced mapping with ggmap
Predictive models/machine learning with caret
Interactive web apps with shiny
Text analysis with tidytext
- Recommmended text: Text Mining with R by Julia Silge and David Robinson (avaible online for free at tidytextmining.com)

Other Stanford courses

Programming: CS 106A
Statistical methods: STATS 60, STATS 101
Data challenge lab: ENGR 150

Thank you! :)